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The efficient recognition of pathogens by the adaptive immune system relies on the diversity of 
receptors displayed at the surface of immune cells. T-cell receptor diversity results from an initial 
random DNA editing process, called VDJ recombination, followed by functional selection of cells 
according to the interaction of their surface receptors with self and foreign antigenic peptides. To 
quantify the effect of selection on the highly variable elements of the receptor, we apply a probabilistic 
maximum likelihood approach to the analysis of high-throughput sequence data from the /3-chain 
of human T-cell receptors. We quantify selection factors for V and J gene choice, and for the length 
and amino-acid composition of the variable region. Our approach is necessary to disentangle the 
effects of selection from biases inherent in the recombination process. Inferred selection factors diffcr 
little between donors, or between naive and memory repertoires. The number of sequences shared 
between donors is well-predicted by the model, indicating a purely stochastic origin of such "public" 
sequences. We find a significant correlation between biases induced by VDJ recombination and our 
inferred selection factors, together with a reduction of diversity during selection. Both effects suggest 
that natural selection acting on the recombination process has anticipated the selection pressures 
experienced during somatic evolution. 



Significance statement 

The immune system defends against pathogens via 
a diverse population of T-cells that display different 
antigen recognition surface receptor proteins. Recep- 
tor diversity is produced by an initial random gene 
recombination process, followed by selection for a de- 
sirable range of peptide binding. Although recombina- 
tion is well-understood, selection has not been quan- 
titatively characterized. By combining high through- 
put sequencing data with modeling, we quantify the 
selection pressure that shapes functional repertoires. 
Selection is found to vary little between individuals or 
between the naive and memory repertoires. It rein- 
forces the biases of the recombination process, mean- 
ing that sequences more likely to be produced are also 
more likely to pass selection. The model accounts for 
"public" sequences shared between individuals as re- 
sulting from pure chance. 



The T-cell response of the adaptive immune system be- 
gins when receptor proteins on the surface of these cells 
recognize a pathogen peptide presented by an antigen 
presenting cell. The immune cell repertoire of a given 
individual is comprised of many clones, each with a dis- 
tinct surface receptor. This diversity, which is central to 
the ability of the immune system to defeat pathogens, is 
initially created by a stochastic process of germline DNA 
editing (called VDJ recombination) that gives each new 
immune cell a unique surface receptor gene. This initial 
repertoire is subsequently modified by selective forces, 
including thymic selection against excessive (or insuffi- 
cient) recognition of self proteins, that are also stochas- 



tic in nature. Due to this stochasticity and the large 
T-cell diversity, these repertoires are best described by 
probability distributions. In this paper we apply a prob- 
abilistic approach to sequence data to obtain quantita- 
tive measures of the selection pressures that shape T-cell 
receptor repertoires. 

New receptor genes are formed by choosing at random 
from a set of genomic templates for several sub-regions 
(V, D and J) of the complete gene. Insertion and dele- 
tion of nucleotides in the junctional regions between the 
V and D and D and J genes greatly enhances diversity 
beyond pure VDJ combinatorics [1]. The variable region 
of the gene lies between the last amino acids of the V 
segment and the beginning of the J segment; it codes 
for the Complementarity Determining Region 3 (CDR3) 
loop of the receptor protein, a region known to be func- 
tionally important in recognition [2]. Previous studies 
have shown that immune cell receptors are not uniform 
in terms of VDJ gene segment usage [3-6] , or probabil- 
ity of generation [1] , and that certain receptors are more 
likely than others to be shared by different individuals 
[4, 7]. In other words, the statistical properties of the 
immune repertoire are rather complex, and their accu- 
rate determination requires sophisticated methods. 

Over the past few years, advances in sequencing tech- 
nology have made it possible to sample the T-cell re- 
ceptor diversity of individual subjects in great depth [8], 
and this has in turn led to the development of sequence 
statistics-based approaches to the study of immune cell 
diversity [9, 10]. In particular, we recently quantita- 
tively characterized the primary, pre-selection diversity 
of the human T-cell repertoire by learning the probabilis- 
tic rules of VDJ recombination from out-of-frame DNA 
sequences that cannot be subject to functional selection 
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and whose statistics can reflect only the recombination 
process [1]. After generation, T-cells undergo a somatic 
selection process in the thymus [11] and later in the pe- 
riphery [12]. Cells that pass thymic selection enter the 
peripheral repertoire as 'naive' T-cells, and the subset of 
naive cells that eventually engage in an immune response 
will survive as a long-lived 'memory' pool. Even though 
we now understand the statistical properties of the ini- 
tial repertoire of immune receptors [1], and despite some 
theoretical studies of thymic selection at the molecular 
level [13, 14], a quantitative understanding of how selec- 
tion modifies those statistics to produce the naive and 
memory repertoires is lacking. 

In this paper, we build on our understanding of the 
primitive pre-selection distribution of T-cell receptors to 
derive a statistical method for identifying and quantify- 
ing selection pressures in the adaptive immune system. 
We apply this method to naive and memory DNA se- 
quences of human T-cell j3 chains obtained from periph- 
eral blood samples of nine healthy individuals (Fig. 1). 
Our goal is to characterize the likelihood that any given 
sequence, once generated, will survive selection. Our 
analysis reveals strong and reproducible signatures of se- 
lection on specific amino acids in the CDR3 sequence, and 
on the usage of V and J genes. Most strikingly, we find 
significant correlation between the primitive generation 
probability of a sequence and the probability it will pass 
selection. This suggests that natural selection, which acts 
on very long time scales to shape the generation mecha- 
nism itself, may have tuned it to anticipate somatic se- 
lection, which acts on single cells throughout the lifetime 
of an individual. The quantitative features of selection 
inferred from our model vary very little between donors, 
indicating that these features are universal. In addition, 
our measures of selection pressure on the memory and 
naive repertoires are statistically indistinguishable, con- 
sistent with the hypothesis that the memory pool is a 
random subsample of the naive pool. 



I. METHODS 

We analyzed human CD4+ T-cell /3-chain DNA se- 
quence reads (60 or 101 nucleotides long) centered around 
the CDR3 region. T-cells were obtained from nine in- 
dividuals and sorted into naive (CD45RO-) and mem- 
ory (CD45RO+) subsets, yielding datasets of -200,000 
unique naive and —120,000 unique memory sequences per 
individual, on average. The datasets are the same as 
those used in [1] and were obtained by previously de- 
scribed methods [15, 16]. 

In [1] we used the "nonproductive" sequences (those 
where either the J sequences are out of frame, or the 
CDR3 sequences have a stop codon) to characterize the 
receptor generation process. The result of that analysis 
was an evaluation of the probability P pr c(<?) that a VDJ 
recombination event will produce a /3-chain gene consis- 
tent with the specific DNA sequence read a. In this study 



we focus instead on the in-frame, productive, sequences 
(from both the naive and the memory repertoires) with 
the goal of quantifying how the post-selection probabil- 
ity distribution on sequences is modified from the origi- 
nal distribution P prc (a). In what follows we distinguish 
between the read a and the CDR3 region r, the latter 
defined to run from a conserved cysteine near the end of 
the V segment to the last amino acid of the read (leaving 
two amino acids to the conserved Phe) . The CDR3 amino 
acid sequence can be uniquely read off from each in-frame 
sequence read; by contrast, the specific V- and J-genes 
responsible for the read may not be uniquely identifiable 
(because of the relatively short read length). An unam- 
biguous selection effect can be seen by comparing the 
length distribution of CDR3 regions between the pre- 
selection ensemble and the naive, or memory, datasets 
(Fig. 2A): sequences that are longer or shorter than the 
mean are suppressed resulting in a more peaked distri- 
bution. 

For each receptor sequence, we define a selection fac- 
tor Q(a) that quantifies whether selection (thymic se- 
lection or later selection in the periphery) has enriched 
or impoverished the frequency of a compared to the 
pre-selection ensemble. Since the generation probabil- 
ity of sequences varies over many orders of magnitude, 
such a comparison is the only way to define selection 
strength. Denoting by P p0 st(o ! ) the distribution of se- 
quences in the selected naive or memory pools, we will 
set -P P ost(<?) = Q(&)Pprc(&)- Due to the large number of 
possible sequences, we cannot sample the post-selection 
probability P pos t for each sequence directly from the 
data; we need a reduced complexity model to estimate 
it. We propose a simple model, summarized in Fig. 1A, 
that we we will show captures the main features of selec- 
tion: 

Q(T ' F ' J) = F prc (f,y,J) =Z qL qvJ 

(1) 

where V and J denote the choice of V and J segments 
in the sequence a, L is the amino-acid length of the 
CDR3 specified by the read, (n, . . . , t 3 l) is CDR3 nu- 
cleotide sequence, and [ax, ■ • ■ iCll) its amino-acid se- 
quence. The factors q^, (fox, (a) and qvj denote, respec- 
tively, selective pressures on the CDR3 length, its com- 
position, and the associated VJ identities. Note that 
the D segment is entirely included in this junctional re- 
gion, so selection acting on it is encoded in the qi-L 
factors. Z enforces the model normalization condition 

T,r,V,jQ{^ J ) P W^,V,J) = l. 

It is important to understand why we do not write 
Q directly as a function of the read a. While (r, V, J) 
determines a and a determines r, V and J cannot always 
be inferred deterministically from the read a. The VJ 
assignment of any given read will have to be treated as 
probabilistically defined hidden variables. In addition, 
because of correlations in P prc , the q factors cannot be 
identified with marginal enrichment factors (so that, for 
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FIG. 1: A. T-cell receptor /3 chain sequences are formed dur- 
ing VDJ recombination. Sequences from this probability dis- 
tribution, described by P pre , are then selected with a factor 
Q defined for each sequence, resulting in the observed P p0 st 
distribution of receptor sequences. Selection is assumed to 
act independently on the V and J genes, the length of the 
CDR3 region and each of the amino acids, a;, therein. B. 
A schematic of the fitting procedure: the parameters are set 
so that Ppost fits the marginal frequencies of amino acids at 
each position, the distribution of CDR3 lengths and VJ gene 
choices. Since the latter is not known unambiguously from 
the observed sequences, it is estimated probabilistically using 
the model itself in an iterative procedure. 



example, Pi ; L,data(ai)/P>;.L,pre(ai), cannot be set equal to 
qi;L(di))- For all these reasons, we must use a maximum 
likelihood procedure to learn the q^, q^L and qy,j factors 
of Eq. 1. We use an expectation maximization algorithm 
(EM) that iteratively modifies the q's until the observed 
marginal frequencies (for CDR3 length, amino acid usage 
as a function of CDR3 position, and VJ usage) in the 
data match those implied by the model distribution Eq. 1 
(the pre-selection distribution P pre being taken as a fixed, 
known, input). The procedure is schematically depicted 
in Fig. IB (see the appendices for full details) . 

One important assumption of the model is that se- 
lection factors act independently of each other on the 
sequence. Consequently, while the model is fit only to 
single point marginal frequencies, and not to pairwisc 
frequencies. To check the validity of this assumption, 
we plot the correlation functions of amino acid pairs in 
the model post-selection repertoire vs the observed naive 
ones (Fig2B). These pairwise correlations are well pre- 



dicted, even though they are not model inputs. R is 
also noteworthy that they are nonzero, even though the 
selection model does not take into account the possibil- 
ity of interactions in the selection factors qi-L- This is 
because the pre-selection distribution does not factorize 
over amino acids in the CDR3 region, and has correla- 
tions of its own, as shown by the green points of Fig 2B 
(note that these pre-selection correlations do not agree 
well with those observed in the post-selection data). 

Another assumption of our model is that selection acts 
at the level of the amino acid sequence, regardless of the 
underlying codons. To test this, we learned more general 
models where a represented one of the possible 61 codons, 
instead of one of the 20 amino acids. We found that 
codons coding for the same residue had similar selection 
factors (see Fig. 8), except near the edges of the CDR3 
where amino acids may actually come from genomic V 
and J segments and reflect their codon biases. 

To compare the different donors, we learned a distinct 
model for each donor, as well as a "universal" model for 
all sequences of a given type from all donors taken to- 
gether (see the appendices for details). We also learned 
models from random subsets of the sequence dataset to 
assess the effects of low-number statistical noise. 



II. RESULTS 

A. Characteristics of selection and repertoire 
variability 

The length, single-residue, and VJ selection factors, 
learned from the naive datasets of all donors taken to- 
gether, are presented in Fig. 2A,C,D. The qvj distribu- 
tion shows that the different V and J genes are subject to 
a wide range of selection factors (note that these factors 
act in addition to the quite varied gene segment usage 
probabilities in P pre (<?)). We looked for correlations be- 
tween the selection factors qi-h{o) on amino acids and 
a variety of amino-acid biochemical properties [17]: hy- 
drophobicity, charge, pH, polarity, volume, and propensi- 
ties to be found in a or /3 structures, in turns, at the sur- 
face of a binding interface, on the rim or in the core [18] 
(see the appendices for details and references) . We found 
no significant correlations, save for a negative correlation 
with amino acid volume and a helix association, as well 
as a positive correlation with the propensities to be in 
turns or in the core of an interacting complex (Fig. 13). 

To estimate differences between datasets, we calcu- 
lated the correlation coefficients between the logs of the 
qvj and qi-L(o>) selection factors (see Fig. 10). Compar- 
ing naive vs. naive, memory vs. memory or naive vs. 
memory between donors (see Fig. 3A-C for an example 
for qi-L, and Fig. 9 for qvj) gave correlation coefficients 
of R3 0.9 in log(?i ; £, while the naive vs. memory reper- 
toires of the same donor gave 0.95. To get a lower bound 
on small-number statistical noise, we also compared the 
factors inferred from artificial datasets obtained by ran- 
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FIG. 3: A.-C. Variability between repertoires. The scatter 
between q^L selection factors between two sample individuals 
A and B for naive (A) and memory repertoires (B) compared 
to that of memory and naive repertoires for the same individ- 
ual (C) shows great similarity between them. See also Fig. 11. 
D. The entropy of the pre-selection repertoire (top) is reduced 
in the post-selection repertoire (bottom). E.-F. Distribution 
of VJ (E) and DJ (F) insertions in the pre-selection and 
naive repertoires shows elimination of long insertions. Error 
bars show standard deviations over 9 donors. The insertion 
distributions for the memory repertoire are the same as for 
the naive repertoire (see scatter plots in insets). 



FIG. 2: A. CDR3 length distributions, pre- and post-selection 
and the length selection factor qL (green). Selection makes 
the length-distribution of CDR3 regions in the pre-selection 
repertoire more peaked for the naive and memory repertoires 
(overlapping) . Error bars show standard variation over 9 indi- 
viduals. B. Comparison between data and model of the con- 
nected pairwise correlation functions, which were not fitted by 
our model. The excellent agreement validates the inference 
procedure. As a control, the prediction from the pre-selection 
model (in gray) does not agree with the data as well. C. 
Values of the inferred amino-acid selection factors for each 
amino acid, ordered by length of the CDR3 region (ordinate) 
and position in the region (abscissa). D. Values of the VJ 
gene selection factors. 



domly shuffling sequences between donors (see the appen- 
dices), yielding an average correlation coefficient of 0.98. 
Repeating the analysis for logtjyj, we found correlation 
coefficients of ~ 0.8 between datasets of different donors, 
0.84 for the naive and memory dataset of the same donor, 
all of which must be compared to 0.94 obtained between 



shuffled datasets. Thus, the observed variability between 
donors of q^x and qvj are small, and consistent with 
their expected statistical variability. 

We use Shannon entropy, S = 
_ Eo ! - P post(o : )log2-Ppost(<?), to quantify the diver- 
sity of the naive and memory distributions. Entropy is 
a diversity measure that accounts for non-uniformity 
of the distribution and is additive in independent 
components. Since S = log 2 Q when there are Q equally 
likely outcomes, the diversity index 2 can be viewed as 
an effective number of states. The entropy of the naive 
repertoire according to the model is 38 bits (correspond- 
ing to a diversity of ~ 3.0 • 10 11 ) down from 43.5 bits 
in the primitive, pre-selection repertoire (Fig. 3D). This 
is a reduction of ~ 6 bits, or 50-fold in diversity. The 
majority of the reduction comes from insertions and 
deletions, which accounted for most of the diversity 
in the pre-selection repertoire. The entropies of the 
memory and naive repertoires are the same, indicating 
that selection in the periphery does not further reduce 
diversity. 
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FIG. 4: Probability of passing selection. A.-B. Ratio of the 
distributions of sequence-wide selection factors Q between the 
observed sequences and the pre-selection ensemble (red line), 
plotted as a function of Q for naive (A.) and memory (B.) 
repertoires. The model prediction P pos t(Q) / P pre {Q) = Q is 
shown in black, and the pre-selection and observed distribu- 
tions of Q are shown in the insets. The selection ratio saturate 
around « 7, which may be interpreted as the maximum prob- 
ability of being selected. Naive and memory repertoires show 
similar behaviors. C. A cartoon of the effective selection land- 
scape captured by our model (red line). Our method does not 
capture localized selection pressures (such as avoiding self) 
specific to each individual, but captures general global prop- 
erties. 



Knowing the post-selection distribution of sequences, 
we can ask how different features of the recombination 
scenario fare in the face of selection. This does not imply 
that selection acts on the scenarios themselves — it acts 
on the final product — but it is an a posteriori assessment 
of the fitness of particular rearrangements. For example, 
the distributions of insertions at VD and DJ junctions in 
the post-selection ensemble have shorter tails (Fig. 3E- 
F), while the distribution of deletions at the junctions 
seems little affected by selection (Fig. 11), although large 
numbers of deletions are selected against. 



By construction, the distribution of Q in the post- 
selection model satisfies exactly P pos t(Q) = QPpxe(Q)- 
In Fig. 4A-B we plot the ratio Pdata(<3)/Pprc(Q) as a 
function of Q, both for the naive and memory models 
learned from all donors. We observe that for Q < 5, i.e. 
for > 90% of sequences, this ratio is exactly equal to Q — 
a validation of our model prediction at the sequence-wide 
level. For larger values of Q however, this ratio saturates 
to around Q max ~ 7. 

This plateau may be viewed as a limiting value, above 
which selection is insensitive to Q. A similar plateau 
was observed in the fitness of transcription factor binding 
sites below a certain binding energy [20] . In the case con- 
sidered here, the plateau can be rationalized if we assume 
that Q is proportional to the probability for a sequence 
to be selected, P se i(<?) = aQ{a). Since P se \ cannot exceed 
one, Q cannot exceed a" 1 . The average probability of se- 
lection is given by Ppre(&)P se i(a) = a. The observed 
plateau gives a lower bound to the true maximum of Q: 
or 1 > Q m ax, and thus the average fraction of cells to 
pass selection satisfies a < 15%. This can be compared 
to estimates [2] for passing positive and negative thymic 
selection: 10 — 30% for positive selection only, and sa 5% 
for both. This analysis only includes the /? chain, and 
including the a chain could further reduce our estimate. 

The saturation also seems to indicate that our model 
may be too simple to describe the very fit (high Q) se- 
quences. Because of its fairly simple factorized structure, 
our model can only account for the coarse features of 
selection, and may not capture very individual-specific 
traits such as avoidance of the self (corresponding to 
Q <C 1 in localized regions of the sequence space) or 
response to pathogens (Q 1 for particular sequences). 
This individual-dependent ruggedness of the fitness land- 
scape Q, schematized in Fig. 4C, is probably ignored by 
our description, and may be hard to model in general. 

To check that the saturation does not affect our infer- 
ence procedure, we relearned our model parameters from 
simulated data, where sequences were generated from 
P pre and then selected with probability min(Q/Q m ax> 1) 
(see the appendices for details), and we found that the 
model was correctly recovered (Fig. 12). 



B. Selection factor as a measure of fitness 

The selection factor Q is a proxy for the probability 
of a particular sequence to be selected or amplified, and 
sequences with large Q values should thus be enriched in 
the observed dataset. To test this, we consider the dis- 
tributions of Q both in the pre-selection model, P prc {Q), 
and in the dataset from which Q was learned, Pdata(Q) 
(insets of Fig. 4; see the appendices for details on how 
the distributions are calculated when V and J are hidden 
variables). This approach is very similar to the one used 
by Mustonen et al. [19, 20] to characterize the fitness 
landscape of transcription factor binding sites. 



C. Natural selection anticipates somatic selection 

Comparing the pre- and post-selection length distri- 
butions in Fig. 2A shows that the CDR3 lengths that 
were the most probable to be produced by recombina- 
tion are also more likely to be selected. Formally, Spear- 
man's rank correlation coefficient between P pre (L) and 
ql is 0.76, showing good correlation between the prob- 
ability of a CDR3 length and the corresponding selec- 
tion factor. We asked whether this correlation was also 
present in the other sequence features. The histogram 
of Spearman's correlation between the selection factors 
<li;L{a) and the pre-selection amino-acid usage Pi;L iPre (a) 
for different lengths and positions (i, L) (Fig. 5A) shows 
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FIG. 5: Correlations between the pre- and post-selection 
repertoires. A. A histogram of Spearman's correlation coeffi- 
cient values between the <7; ; l(<i) selection factors in the CDR3 
region and their generation probabilities Pi-L, P re(a) for all i, L 
shows an abundance of positive correlations. B. Heatmap of 
the joint distribution of the pre-selection probability distri- 
bution P prG and selection factors Q for each sequence shows 
the two quantities are correlated. C. Sequences in the ob- 
served, selected repertoire (green line) had a higher probabil- 
ity to have been generated by recombination than unselected 
sequences (blue line). Agreement between the post-selection 
model (red line) and data distribution (green line) is a vali- 
dation of the model. 



a clear majority of positive correlations. Likewise, the 
selection factors qyj are positively correlated with the 
pre-selection VJ usage Pyj lPre (Spearman's rank correla- 
tion 0.3, p < 2 • 1CT 20 ). 

The correlations observed for each particular feature 
of the sequence (CDR3 length, amino acid composition 
and VJ usage) combine to create a global correlation 
between the probability P prc ((?) that a sequence a was 
generated by recombination, and its propensity Q(a) to 
be selected (Spearman's rank correlation 0.4, p = 0, 
see Fig. 5B). Consistent with this observation, the post- 
selection repertoire is enriched in sequences that have a 
high probability P pre (cr) to be produced by recombina- 
tion (Fig. 5C). This enrichment is well predicted by the 
model, providing another validation of its predictions at 
the sequence- wide level. 

Taken together, these results suggest that the mech- 
anism of VDJ recombination (including insertions and 
deletions) has evolved to preferentially produce sequences 
that are more likely to be selected by thymic or periph- 
eral selection. 



D. Shared sequences between individuals 

The observation of unique sequences that are shared 
between different donors has suggested that these se- 
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FIG. 6: Shared sequences between individuals. A. The mean 
number of shared sequences between any pair of individuals 
compared to the number expected by chance (model predic- 
tion) for one common model for all individuals (red crosses) 
and private models learned independently for each individual 
(blue crosses). Error bars are standard deviations from dis- 
tributions over pairs. The distribution of shared sequences 
between triplets (B.) and quadruplets (C.) of individuals 
for the data (black histogram), from common (red line) and 
private models (blue line). D. The shared sequences are 
most likely to be generated and selected: comparison of the 
Ppost post-selection distribution for sequences from the pre- 
selection (dotted line), and post-selection repertoires (accord- 
ing to the model in gray, and to the data in black), as well as 
the sequences shared by at least two donors (model prediction 
in magenta, data in red). 



quences make up a "public" repertoire common to many 
individuals, formed through convergent evolution or a 
common source. However, it is also possible that these 
common sequences are just statistically more frequent, 
and are likely to be randomly recombincd in two individ- 
uals independently, as previously discussed by Venturi et 
al. [6, 7, 21]. In other words, public sequences could just 
be chance events. Here we revisit this question by asking 
whether the number of observed shared sequences be- 
tween individuals is consistent with random choice from 
our inferred sequence distribution P p0 st- 

We estimated the expected number of shared sequences 
between groups of donors in two ways: (i) by assuming 
that each donor a had its own "private" model learned 
from his own sequences or (ii) by assuming that sequences 
are drawn from a "universal" model learned from all 
sequences together. While the latter ignores small yet 
perhaps significant differences between the donors, the 
former may exaggerate them where statistics are poor. 
For details on how these estimates are obtained from the 
models, we refer the reader to the appendices. In Fig. 6A 
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we plot, for each pair of donors, the expected number of 
shared nucleotide sequences in their naive repertoires un- 
der assumptions (i) and (ii), versus the observed number. 
The number is well predicted under both assumptions, 
the universal model assumption giving a slight overes- 
timate, and the private model giving a slight underes- 
timate. We repeat the analysis for sequences that are 
observed to be common to at least three or at least four 
donors (Fig. 6B-C) . The universal model predicts their 
number better than the private models, although it still 
slightly overestimates it. 

These results suggest that shared sequences are indeed 
the result of pure chance. If that is so, shared sequences 
should have a higher occurrence probability than average; 
specifically, the model predicts that the sequences that 
are shared between at least two donors are distributed 
according to Pp OS t(&) 2 (see the appendices). We test this 
by plotting the distribution of P post for regular sequences, 
as well as for pairwise-shared sequences, according to the 
model and in the naive datasets (Fig. 6D), and find ex- 
cellent agreement. In general, sequences that are shared 
between at least n individuals by chance should be dis- 
tributed according to P pos t (<?)"• For triplets and quadru- 
plets, this model prediction is not as well verified (see 
Fig. 14). This discrepancy may be explained by the fact 
that such sequences are outliers with very high occur- 
rence probabilities, and may not be well captured by the 
model, which was learned on typical sequences. 

We repeated these analyses for sequences shared be- 
tween the memory repertoires of different individuals, 
with very similar conclusions, except for donors 2 and 
3, and donors 2 and 7, who shared many more sequences 
than expected by chance (see Fig. 15). We conclude that 
the vast majority of shared sequences occur by chance, 
and are well predicted by our model of random recombi- 
nation and selection. 



III. DISCUSSION 

We have introduced and calculated a selection factor 
Qip) that serves as a measure of selection acting on a 
given receptor sequence a in the somatic evolution of the 
immune repertoire. Our approach accounts for the fact 
that the pre-selection probabilities of sequences vary over 
orders of magnitude. 

Using this measure, we show that the observed reper- 
toires have undergone significant selection starting from 
the initial repertoire produced by VDJ recombination. 
We find little difference between the naive and memory 
repertoires, in agreement with recent findings showing 
no correlation between receptor and T-cell fate [22], as 
well as between the repertoires of different donors. This 
is perhaps surprising, because the donors have distinct 
HLA types (which determine the interaction between 
T-cell receptors and peptide-MHC complexes), and we 
could expect their positive and negative selective pres- 
sures to be markedly different. Besides, memory se- 



quences have undergone an additional layer of selection 
compared to the naive ones — recognizing a pathogen — 
and we could also expect to see different signatures of 
selection there. A possible interpretation is that our 
model only captures coarse and universal features of se- 
lection related to the general fitness of receptors, and 
not the fine-grained, individual-specific selective pres- 
sures such as avoidance of the self, as illustrated schemat- 
ically in Fig. 4C. In other words, our selection factors may 
"smooth out" the complex landscapes of specific reper- 
toires and fail to capture their rough local properties, 
such as would be expected from specific epitopes that 
would correspond to very tall peaks or deep valleys in 
the landscape of selection factors. To really probe these 
specific deep valleys, we need to develop methods based 
on accurate sequence counts. Another interesting future 
direction would be to see whether at this global level 
the signatures of selection are similar between (relatively) 
isolated populations. Lastly, comparing data from differ- 
ent species (mice, fish), in particular where inbred indi- 
viduals with the same HLA type can be compared, would 
be an interesting avenue for addressing these issues. 

Our results suggest that natural selection has refined 
the generation process over evolutionary time scales to 
produce a pre-selection repertoire that anticipates the 
actions of selection. Sequences that are likely to be elim- 
inated and fail selection are not very likely to be produced 
in the first place. Because of this "rich become richer" ef- 
fect, the diversity of the repertoire is significantly reduced 
by selection, by a 50-fold factor in terms of diversity in- 
dex. This does not mean that only 2% of the sequences 
pass selection. In fact, our results are consistent with 
as much as 15% of sequences passing selection. This ap- 
parent paradox is resolved by noting that selection, by 
keeping clones that were likely to be generated, get rids 
of very rare clones that contributed to the large initial 
diversity. 

Although we did observe sequences that were present 
in the repertoires of different donors, we showed using our 
model that their number was broadly compatible with 
that expected by pure chance. This suggests that the 
"public" part of the repertoire is made of sequences that 
are just more likely to be randomly produced and se- 
lected. 

To summarize, our work clearly shows that thymic se- 
lection and later peripheral selection modify the form of 
the generated repertoire. Our work is a starting point for 
a description of a mechanism of the two processes. 
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APPENDIX A: DATA 

The DNA nucleotide data used in our analysis con- 
sists of human CD4+ naive (CD45RO-) or memory 
(CD45RO+) /3 chain sequences from 9 healthy indi- 
viduals, sequenced and made available to us by H. 
Robins and already used in [1]. Reads are 60 base 
pair long for 6 donors and 101 base pair long for 3 
donors (individuals 2, 3 and 7) and contain the CDR3 
region and neighboring V and J gene nucleotides. All 
end at the same position in the J gene, with four nu- 
cleotides between this position and the first nucleotide 
of the conserved phenylalanine. The data were divided 
into out-of-frame reads (non-coding), used to learn the 
pre-selection model as described in [1] and in-frame 
(coding) reads used in the analysis presented in this 
paper. The sequence data we used are available at 
http : //princeton. edu/~ccallan/TCRPaper/data/. 

In our study we limit ourselves to unique sequences. 
The experimental procedure and initial assessment of the 
quality of the reads were done in the Robins lab following 
the procedures described in [15, 23]. Each sequence was 
read multiple times, allowing for the correction of most 
sequencing errors. The numbers of unique sequences used 
in each dataset is shown in Table SI. 





Naivo 


Memory 


Donor 1 


311917 


177744 


Donor 2 


242254 


135567 


Donor 3 


195007 


119906 


Donor 4 


130958 


142017 


Donor 5 


147848 


32468 


Donor 6 


187245 


104119 


Donor 7 


251335 


136419 


Donor 8 


42326 


120527 


Donor 9 


254349 


89830 



TABLE I: Number of unique coding sequences in each 
datasets. 

The alignment to all possible V and J genes was done 
using the curated datasets in the IMGT database [24]. 
There are 48 V genes, 2 D genes and 13 J genes plus a 
number of pseudo V genes that cannot lead to a function- 
ing receptor due to stop codons. We discarded sequences 
that were associated to a pseudo-gene as our model only 
accounts for coding genes. The germline sequences of the 
genes used in our analysis are the same as were used in 
[1] to analyze the generative V(D)J recombination pro- 
cess. The complete list of gene sequences can be found at 
http : //princeton. edu/~ccallan/TCRPaper /genes/. 

APPENDIX B: PRE-SELECTION MODEL 

The pre-selection, or generative model, assumes the 
following structure for the probability distribution of re- 



combination scenarios S [1]: 

P pl . c (S) =P{V)P{D, J)P(insVD)P(insDJ) 

P(delV|y)P(dellD,delrD|Z))P(delJ|J) 
P{s x )P{s 2 \s x )---P{ 

SinsVD| s insVD-l) 
P(tl)P{t 2 \h) ■ ■ -P^insDjItinsDJ-l), 

where a scenario is given by the VDJ choice, the 
number of insertions insVD, insDJ and the num- 
ber of deletions (delV,dellD), (delrD,delJ) at each 
of the two junctions, together with the identi- 
ties (si, . . . , SinsVD),(*i) • • • jtinsDj) OI the inserted nu- 
cleotides. It is worth noting that the insertions are as- 
sumed to be independent of the identities of the genes 
between which insertions are made. By contrast, the 
deletion probabilities are allowed to depend on the iden- 
tity of the gene being deleted. The validity of these as- 
sumptions is verified a posteriori. 



APPENDIX C: MODEL FITTING 

1. Maximum likelihood formulation 

The model probability to observe a given coding nu- 
cleotide sequence is: 

P pos t(r, V, J) = Q(t, V, J)P prc (r, V, J), (CI) 

where r = (r 1; . . . , t^l) is the nucleotide sequence of the 
CDR3 (defined as running from the conserved cysteine 
in the V segment up to the last amino acid in the read, 
leaving two amino acids between the last read amino acid 
and the conserved phenylalanine in the J segment), L is 
the length of the CDR3, and V and J index the choice 
of the germline V and J segments (which completely de- 
termine the sequence outside the CDR3 region). The D 
segment is entirely absorved into r, and is not explicitly 
tracked in assessing selection. The selection factor Q is 
assumed to take the following factorized form: 

1 L 
Q(f, V,J) = - q L qv,.r Y[qi;L(ai). (C2) 

i=l 

where a = (aj., . . . , a^) is the amino-acid sequence of the 
CDR3, and Z is a normalization constant that enforces 

P P ont(r,V,J) = l. (C3) 

T,V,J 

The probability, P pre (r , V, J) , of generating a specific 
sequence in a V(D)J recombination event can be ob- 
tained from the noncoding sequence reads by the meth- 
ods explained in [1] . Specifically, the pre-selection model 
gives the probability P pl - c (S) of a recombination scenario 
S = (V, D, J, insVD, insDJ, delV, . . .) as given by Eq. Bl. 
A scenario S completely determines the sequence r, but 
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FIG. 7: Summary of the notations used in this paper for the 
sequences. The CDR3 region is defined from the conserved 
cysteine around the end of the V segment to the last amino- 
acid in the read, leaving two amino acids to the conserved 
phenylalanine in the J segment. The nucleotides in the read 
are defined as Oi, the nucleotides in the CDR3 region as r; 
and the amino acids in the CDR3 region as dj. The data 
sequences therefore can be defined in terms of a, or their V, 
J genes and f. The generated sequences, with known V and 
J genes, are defined in terms of £ for the whole sequence or p 
for only the CDR3. 



the converse is not true. The pre-selection probability 
for a coding sequence is thus given by 



Pcodii 



PpUS) (C4) 



where we sum over scenarios resulting in a particular 
CDR3 sequence r and a particular V, J pair. The nor- 
malization factor ^coding ~ 0.26 corrects for the fact that 
a randomly generated sequence is not always productive 
(i.e. in- frame and with no stop codon). From this point 
on, we regard the initial generation probability of any 
specific read as known. When we make statements about 
the pre-selection distribution of CDR3 properties, such as 
length or amino acid utilization, they are derived from 
synthetic repertoires drawn from the above pre-selection 
distribution. 

We want to infer the parameters qL, qv,J arL d of 
the model from the observed coding sequence repertoires. 
Formally we want to maximize the likelihood of the data 
given the model. Unfortunately the sequence reads from 
the data are not long enough to fully specify the V and 
J segments, so we cannot use P pos t (t, V, J) as our raw 
likelihood. Instead, we need to write the probability of 
observing a given (truncated) read cr, of length 60 or 101 



nucleotides, depending on the donor: 



P P o S t(T,V,J). (C5) 



Ppost (8) = ^ P, 

where we note again that (r, V, J) fully specifies er, while 
(7 fully specifies r, but not V and J. Given a dataset of 
N sequences, ct 1 , . . . , <j N (see Fig. 7 for notations), the 
likelihood reads: 



N 



£(q) = n p post(<? a ) 



(C6) 



a=l 



Our goal is maximize C with respect to the parameters 
<ZLi Iv.j, and 9i ; i(-) (globally refered to as Q). 



2. Expectation maximization 

Calculating P pos t(o ; ) is computationally intensive. 
Given the form of the model, it seems more natural to 
work with P pos t (r, V 1 J) , but this likelihood involves the 
"hidden" variables V and J. To circumvent this problem, 
we use the expectation maximization algorithm [25, 26]. 
This algorithm uses an iterative two-step process, with 
two sets of model parameters Q and Q' . The log- 
likelihood of the data is calculated using the set of param- 
eters Q'; in the "Expectation" step, this log-likelihood is 
averaged over the hidden variables with their posterior 
probabilities, which are calculated using the second set 
of parameters Q. In the "Maximization" step, this av- 
erage log-likelihood is maximized over the first set Q' , 
while keeping the second set Q fixed. Then Q is updated 
to the optimal value of Q' ', and the two steps are repeated 
iteratively until convergence. 

In practice, starting with a test set of parameters Q, 
we calculate, for each sequence of the data, the posterior 
probability of a (V, J) pair: 



Ppost 04, Ja\v a ) 



Q(T a ,V a ,Jg)P p re(? a ,Va,Ja) 

Ev,jQ^ a ,V, J)P pte (r a ,V,J)- 



(C7) 

The log-likelihood, expressed in terms of the hidden vari- 
ables V and J, is maximized after averaging over V and 
J using that posterior. Specifically we will maximize: 



N 



£{Q'\Q) = E (l0gPpost(^, V a , Ja,Q')) c 



N 



P P ost(V a ,J a \a a ;Q)\ogP post (T a ,V a ,J a ;Q'). 

(C8) 



a=l V,J a 



Here we have added the Q dependencies explicitly be- 
cause there are two different parameter sets Q and 
Q' . The maximization is performed over Q', which 
parametrizes the log-likelihood itself, while keeping Q, 
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which parametrizes how the average is done over the hid- 
den variables, constant. After each maximization step we 
substitute: 



Q <- argmaxQ,£(Q'|(9), 



(C9) 



and iterate until convergence. This procedure is guaran- 
teed to find a local maximum of the likelihood C(Q). 



3. Equivalence with fitting marginal probabilities 

The expectation-maximization step can be simplified 
by noting that at the maximum, derivatives vanish: 



dC(Q'\Q) 
dQ' 



0. 



(CIO) 



Precisely, we take derivatives with each of the param- 
eters, qL, qvj etc. and set them to zero. Since 
P post (T, V 1 J) is naturally factorized in the Q parameters, 
we obtain simple expressions, e.g. dC/dlogq' L = 0 gives: 



parameters Q' , which are then varied to achieve equal- 
ity with the data estimate). The approach of iteratively 
adjusting model parameters to match a corresponding 
set of data marginals is a conceptually clear and com- 
putationally effective implementation of the expectation 
maximization algorithm. 



4. Gauge 

As defined above, the model is degenerate: for each 
i,L, the factors <7i ; z,(a) and Z may be multiplied by a 
common constant without affecting the model. We need 
to fix a convention, or gauge, to lift this degeneracy. We 
impose that, for each i,L: 



E Pi;L,prc(a)qi;L(a) = !• 



(C16) 



where Pi ; L ]P re(a) is the probability of having amino-acid 
a at position i in CDR3s of length L. 



N 



E E P Post(Va, Ja\<J a ;Q) (S La , L ~ I = 0. 



a=l V a ,J a 



d\ogq' L 



(CH) 

where 8 a ^ is Kronecker's delta function. The term in the 
sum gives the total number of sequences in the data with 
length L. Besides we have: 

fr^ = E S Lm,LPpost(r,V, J;Q') = P pos t(L;Q')- 
a log (7V 

bHL T,V.J 

(C12) 

Hence the maximality condition simply becomes: 

Pd a ,t a (L) = P pos t(L;Q'), (C13) 

i.e. that the length distribution of the model must be 
equal to that of the data. Similarly, maximizing with 
respect to qi-L^i) entails that single amino-acid frequen- 
cies at a given position are matched between data and 
model: 



P, 



z;L.data 



z;L,post 



(a i; Q')- 



(C14) 



The condition for qyj is slightly different, because we do 
not directly have the frequencies of V and J in the data. 
This is replaced by their expected frequency under the 
posterior P p0 st(V r a , J a \& a ) taken with parameters Q: 



1 N 

-Y 

a=l 



P pos t(V,J\a a ;Q) = P pos t(V,J;Q'), (C15) 



where again the left-hand side is the empirical distribu- 
tion of V and J (indirectly estimated with the help of the 
model with parameters Q) , and the right-hand side is the 
model distribution of the same quantities (estimated with 



5. Numerical implementation 

To solve the fitting equations (C13)-(C15) in practice, 
we use a gradient descent algorithm: 



q L ^q L +e [-Pdata(i) - P pos t{L; Q')] , 



(C17) 



and similarly for q^ and qvj- To do this, we 
must be able to calculate the marginals -P pos t(£; 
Pi;L, P ost{auQ') and P pos t(V, J; Q') from the model at 
each step. 

This leaves us with the problem of estimating 
marginals in the model, which we do using importance 
sampling. Although it is easy to sample sequences from 
-Ppre by picking a random recombination scenario, sam- 
pling from Pp OS t = QP P ic is much harder, as the q^L, q^ 
and qyj factors introduce complex dependencies between 
the different features of the recombination scenario. To 
overcome this issue, we sample a large number M of 
(t, V, J) triplets from P pie (T, V, J), and, when estimating 
Pp OS t expectation values, weight the contribution of each 
sequence with its Q(t, V, J) value (this is a particularly 
simple instance of importance sampling) . The generated 
triplets are denoted by [(p 1 , Vi, Ji), . . . , (p M , V M , J M )], 
and the corresponding reads by (|\ . . . ,S} 4 ) (see Fig. 7 
for notations). The marginal probability distribution of 
lengths, for instance, is estimated by 



E^i^,lQV,H,J 6 ) 



P P ost(L] Q 



(C18) 



and similar expressions give estimates of Pi-,L, P ost(ai', Q') 
and P pos t{V, J; Q'). Since we are optimizing over Q' ', the 
sequences (/r, V&, Jb) can be generated once and for all at 
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FIG. 8: The <?i;i(a) selection factors learned for codons (red crosses) agree with those learned for amino acids (blue). The 
9i ; i(a) are plotted for each position in the CDR3 region (panels from 1 to 12) for naive CDR3 sequences of length 12, as a 
function of the amino acids at each position. A given amino acid at a given position can come from different codons, which are 
marked by multiple crosses at that position. Codons or amino acids for which there was not enough data to infer the selection 
factors are not represented. 




FIG. 9: The scatter of VJ gene selection factors qvj between donors A and B for naive (A) and memory repertoires (B), as 
well as between the memory and naive repertoires of the same individual (C) shows that the memory and naive repertoires 
are statistically similar to each other and across individuals. See Fig. 10 for the correlation analysis of all individuals and cell 
types. 
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A. Correlation coefficients of log gj ; i(a) between datasets 



1=9 1=10 !=(/ t=« L-13 t=M 1=15 



B. Correlation coefficients of log q V J between datasets 




FIG. 10: Correlation coefficients between selection factors ob- 
tained for models learned for different donors and cell type 
(naive and memory). The compared factors are the amino- 
acid selection factors q^L (A) and the VJ gene selection fac- 
tors qv.j (B). Each position along the two axes in each plot 
corresponds to a different individual. The naive dataset of 
donor 8, and the memory dataset of donor 5 were removed 
because of too low statistics. In all heat maps, the x and y 
axes correspond to different donors (l-7;9 for naive, l-4;6-9 
for memory, and 1,2,3,4,6,7,9 for comparison between naive 
and memory). 



the beginning of the algorithm. Then the marginal prob- 
abilities are updated according to the modified Q' using 
Eq. C18. Finally, the normalization constant is evaluated 
by calculating: 



(C19) 



6=1 



so that 



M 

Y, Wr, V, J) « — J2 Q(?> H> Jb) = 1. (C20) 



T,V,J 



6=1 



Equivalence with minimum discriminatory 
information 



The principle of minimum discriminatory information 
is to look for a distribution that reproduces exactly some 
mean observables of the data, such as position-dependent 
amino-acid frequencies, while being minimally biased 
with respect to some background distribution. When 
the background distribution is uniform, this principle is 
equivalent to the principle of maximum entropy. 

Taking P pro as our background distribution, assume 
we are looking for the distribution P pos t that satisfies 
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FIG. 11: The effects of selection on deletion profiles. Distribu- 
tion of V (A), D left-hand side (B), D right-hand side (C), 
and J (D) deletions in the pre-selected (black lin e), naive 
(colored line) and memory (gray dashed line) repertoires. Er- 
ror bars show standard deviation over 9 individuals. Results 
using 9 separate models learned for each of the individuals. 
The deletion distributions for the memory repertoire are the 
same as for the naive repertoire. Selection has a slight effect 
on favoring distributions with non-extreme deletion values of 
deletions for V and J deletions, and does not have a signifi- 
cant effect on D deletions. 



Eqs. (C13)-(C15) while minimizing the divergence or rel- 
ative entropy with respect to P prc , defined as: 



£>KL(Ppo B t||Ppre) = ^ ^post (f , V, J) log 



T,V,J 



Ppostj^V, J) 
P p re(f,V,J) ■ 

Solving this problem is mathematically equivalent 
to solving the maximum likelihood problem described 
above. 



APPENDIX D: INDIVIDUAL, UNIVERSAL AND 
SHUFFLED DONORS 

We partition the data in three different ways to learn 
the model. First, we learn a distinct model for each 
donor, and for each of the naive and memory pools. For 
each donor, we have a distinct P pre learned from the out- 
of-frame sequences of that donor (although in fact they 
differ little from donor to donor as discussed in [1]). Sec- 
ond, we pool all the sequences of a given type (naive 
or memory) from all nine donors together, and learn a 
"universal" or average model. For this we use a mean 
P pre averaged over all nine donors, and then learn Q us- 
ing all sequences. Third, to assess the effect of finite-size 
sampling in the universal model, we partition the data 
from all donors into nine random subsamples of equal 
sizes. This way we can estimate how much variability 
one should expect from just sampling noise. 
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APPENDIX E: ENTROPY, DISTRIBUTIONS OF 



P pro , Ppost AND Q 



13 



To estimate global statistics, such as entropy, from the 



model, we draw a large set of sequences (£ 



a 



from Pprc, and weight them according to the inferred 
(normalized) Q values. Specifically, for each generated 
sequence, we estimate its primitive generation probabil- 
ity by summing over all the possible scenarios that could 
have given rise to it: 



1 



Pcodir 



(El) 



where £ is the full nucleotide sequence, including the 
CDR3 p 6 as well as the V& and Jb segments. The entropy 
(in bits) of the selected sequence repertoire is defined as 

H [P port ] = -£ Pp OS t (<?) log 2 P post (a) (E2) 



and, to include selection effects, we estimate it by 

1 M r 
H [P P ost] ~ E Jfc ) lo S V5, J 6 )P pre (f ) 

6=1 

(E3) 

The distributions of P prc , P pos t and Q over the selected 
sequences are determined from the same draw of M se- 
quences from P pre , weighted by the normalized selection 
factors Q. For example the distribution of logP prc is: 



M 



Og P prc ) « — £ Q(?> V b> J b)5 log Pprc ~ bg Pprc(^ 



M 

6=1 

(E4) 

Marginal distributions over pairs of amino-acids 
(ai,a,j) at two positions i and j can also be calculated 
using the sequences and weighting them with Q. This 
can be generalized to arbitrary marginals or statistics. 



APPENDIX F: SHARED SEQUENCES 




-10 12 3 
log(q.. L ) - universal donor 



FIG. 12: The saturation of the Pdata(Q) / Ppre(Q) ratio does 
not affect the inference of the model. We simulated a 
dataset from P pro and selected sequences with probability 
min[<5((?)/7, 1]. The plot compares the qi-L{a) selection fac- 
tors directly inferred from data (ordinate) to values inferred 
from such simulated data (blue dots: simulation). The scat- 
ter in these points is compared to the scatter obtained from 
learning the selection factors using a random subset of the 
data (red dots: sample). The size of the points denotes the 
probability Pi ; i,data(a) in the data repertoire. 



If we assume private models, the expected number of 
shared sequences between donors a and j3 is: 



(Fl) 



The number of shared sequences in a subset of donors 
is counted based on the nucleotide sequences. This em- 
pirical number can then be compared to two kinds of 
theoretical predictions. Either by assuming that the se- 
quences of each donor were generated and selected by a 
"private" model Ppo St , where a denotes the donor, i.e. a 
model inferred from the sequences of donor a; or by as- 
suming that sequences were generated and selected by a 
"common" or universal model Pp^L inferred from all se- 
quences together. The latter is justified by the fact that 
differences between private models are small, and could 
reflect spurious noise that would exaggerate differences 
between individuals. 



where N a and Np are the numbers of sequences in each 
donor dataset. To estimate that number, we collect se- 
quences that are shared between the generated datasets 
{£ a } of two (or more) donors, and reweight them by Q: 



£ Q (a) (p,V,J)Q W (p,V,J), (F2) 



NgNj 
M a Mg 

where M a and Mp are the number of generated sequences 
for each donor model, and where the sum is over the se- 
quences found in the {£ a } dataset of both donors. Similar 
equations are used for comparing more than two donors. 
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B. CC[q i:L (a), beta(a)] 



C. CC[8 ii i(o),turn(a)] 




-0.5 0 0.5 

Spearman's correlation 



FIG. 13: Correlation of the selection factors with several biochemical properties. Each panel shows the histogram, over all 
positions and lengths, of Spearman's correlation coefficient between the q^L (a) values for a given amino acid and the biochemical 
properties of that amino acid. The following biochemical properties are considered (from left to right, top to bottom): preference 
to appear in alpha helices (A), beta sheets (B), turns (C) (source for (A-C): Table 3.3 [17]). Residues that are exposed to solvent 
in protein-protein complexes (following definitions and data from [18]) are divided intothree groups: surface (interface) residues 
that have unchanged accessibility area when the interaction partner is present (D), rim (interface) residues that have changed 
accessibility area, but no atoms with zero accessibility in the complex (E) and core (interface) residues that have changed acces- 
sibility area and at least one atom with zero accessibility in the complex (F). Rim residues roughly correspond to the periphery 
of the interface region, and core residues correspond to the center. Finally we plot the basic biochemical amino acid proper- 
ties (source: http://en.wikipedia.org/wiki/Amino_acid and http://en.wikipedia.org/wiki/Proteinogenic_amino_acid): 
charge (G), pH (H), polarity (I), hydrophobicity (J) and volume (K). For all properties the actual numerical values used to 
calculate the correlations are listed in the inset tables. We see a positive correlation trend with turns and core residues and a 
negative correlation trend with the preference of amino acids to appear in alpha helices and volume. 
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If we assume a common model, the expected number 
of shared sequences reads: 



(F3) 



This can be estimated by: 
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t V" pW [Q(«) (p*^ v b , Jb)} 2 , (F4) FIG. 14: Model prediction (magenta) and observed (red) dis- 



M 



6=1 



where {£ a } are sequences generated from the mean VDJ 



(u) 

recombination model Pp ro . Similarly, the number of 
shared sequences between a triplet of donors a, j3, 7 is: 

N a N 0 N 7 



tributions of P pos t in the naive sequences that are shared be- 
tween at least three (left) or four (right) donors. The model 
discrepancy may be attributed to its failure to capture the 
very highly probable sequences. 



500 



M 



M 

E 

6=1 



[p^(e)] 2 [Q {u) (^:V b ,j b )}\ 



(F5) 



400 



and likewise for quadruplets and more. 

The expected numbers of shared sequences calculated 
above are averages. Their distribution is given by a Pois- 
son distribution of the same mean. We use these Poisson 
distribution to estimate the error bars in Fig. 6A and 
15A, as well as the distributions in Fig. 6B-C and S15B- 
C. 

If we assume a common model, sequences that are 
shared between at least n individuals are distributed ac- 
cording to oc [-Ppost]"- To explore the statistics of these 
sequences, we take our fP sequences generated from Pp"2 

and weigh them with [P^ci^W^iQ^H^T- For ex- 
ample, to estimate the distribution of logPp OSt in shared 
sequences as in Fig. 6D (for pairs), and Fig. 14 (for 
triplets and quadruplets), we calculate: 
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FIG. 15: Comparison between data and model for the number 
of shared sequences in the memory repertoires, in pairs (A) , 
triplets (B) and quadruplets (C) of individuals. 



Sampling from shared sequences is equivalent to sam- 
pling from the high-probability, large deviation regime of 
the distribution. This statement can be made more phys- 
ically intuitive by rewriting P pos t as a Boltzmann distri- 
bution e _B / T with T = 1 and E = — log P pos t- Consider- 
ing sequences observed in at least n donors, is equivalent 
to sampling from (l/Z(n))e~ nE (where Z(n) is a normal- 
isation constant), i.e. the Boltzmann distribution with 
T = 1/n. Sequences shared between more and more in- 
dividuals correspond to lower and lower temperatures, 
and thus lower energies and higher probabilities. In the 
low temperature regime, the roughness of the landscape 
depicted in Fig. 4C is starting to become important, and 
may not be well captured by our model, as suggested by 
Fig. 14. 



APPENDIX G: CODON MODEL 

It is reasonable to assume that selection acts on the 
protein structure, at the amino acid level. But each 
amino acid can be obtained using a number of differ- 
ent codons, which could in principle each have a differ- 
ent selection factor. We checked the robustness of our 
selection coefficients by learning an alternative model in 
which selection acts on codons. We present the results 
of this alternative codon model in Fig. 8 on the example 
of CDR3 sequences of length 12. We show the qi-i l (a) se- 
lection factors at each position for each amino acid, and 
compare them to the selection factors obtained for the 
codons coding for that amino acid. We see that, espe- 
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cially in the bulk of the CDR3 sequence, selection at the 
level of codons or amino acids are equivalent, proving the 
generality of our approach. 

APPENDIX H: ADDITIONAL EFFECTS OF 
SELECTION ON REPERTOIRE PROPERTIES 

In the main text we present several repertoire prop- 
erties, such as insertion profiles and comparisons of 
the Qi-L(o-) selection factors between naive and memory 
repertoires. In Fig. 11 we plot the deletion profiles for 
V, J and D-lefthand side and D-righthand side dele- 
tions, comparing the distributions for the pre-selection, 
naive and memory repertoires. We note that the deletion 
profiles for the V and J distributions are more peaked, 
favoring intermediate deletion values. However the D dis- 
tributions are little affected by selection. Similarly to the 
case of insertion distributions shown in in Fig. 3E-F, the 
naive and memory distributions appear indistinguishable 
within the error bars. 

In Fig. 3A-C, the selection factors <7i ; z,(a) acting on 
amino acids are compared between individuals and cell 
type. Similarly, the selection factors acting on the genes 
qvj are statistically indistinguishable between the mem- 
ory and naive repertoires for one individual, compared to 
the variability between the naive (or memory) repertoires 
taken from two sample individuals (see Fig. 9). 

To compare the repertoires of individuals as well as 
the naive and memory repertoires with each other, we 
consider the correlation coefficients between the selec- 
tion factors log qi-L, and between the VJ gene selection 
factor logqvj, of different individuals (Fig. 10). Correla- 
tions between memory and naive repertoires are similar 
to those between naive-naive or memory-memory reper- 
toires for different individuals; all are a bit smaller than 
the correlations between the artificial, shuffled sequence 
datasets, where the discrepancy is entirely attributable 
to statistical noise. These observations lead us to the 
conclusion that at this level of description, the selection 
processes that shape the memory and naive repertoires 
are very similar with each other and between different 
individuals. 



APPENDIX I: EFFECTS OF SATURATION OF 
GLOBAL SELECTION FACTORS ON THE 
INFERENCE PROCEDURE 

We consider distributions of the selection factor Q in 
the pre-selection ensemble P pr c(<3): m the post-selection 
ensemble according to the model -P p0 st(<9), and in the ac- 
tual data sequences -Pdata(Q)- These three distributions 
are formally defined as: 

1 M 

p p UQ) = ^$>[Q-Q(^,vk ■/*)]• ( n ) 

b=l 
1 M 

Ppost(Q) = T ^Q(p*,^,J b )<5[Q-Q(p*,n, MP) 

b=l 

= QPpre(Q). (K) 
1 - 

^data(Q) = E P PO S t(V a ,J a \& a ) 

a=l V a J a 

xSlQ-Qi^^Ja)] (14) 

As can be seen in Fig. 4, the ratio of the distribution 
of global selection factors -Pdata(Q)/-fpre(Q) saturates for 
large values of Q. To make sure that this saturation 
does not impair our ability to correctly infer the selection 
factors, we simulated a dataset from P prc and selected 
sequences with probability min[Q(o ; )/7, 1] to mimic the 
effects of this plateau. We then inferred the selection 
coefficients for this artificial dataset. We see that the 
saturation does not affect our ability to correctly infer 
the selection coefficients (Fig. 12) and the variability in 
the inferred q^L {a) selection factors is of the same order 
as from using random subsamples of the original data. 

APPENDIX J: BIOCHEMICAL CORRELATIONS 

To check for correlations of our inferred Qj ; i(a) selec- 
tion factors with known biochemical properties, we calcu- 
lated Spearman's coefficient between the selection factors 
and a number of standard quantities (see Fig. 13 for the 
full list) . We find that the selection factors do not corre- 
late well with most standard properties, such as charge, 
hydrophobicity and polarity. However we do find a trend 
of positive correlation with amino acids that are likely 
to appear in turns (Fig. 13 C) and ones that have been 
identified as those that make the core of the interface 
in a protein-protein complexes (Fig. 13 F) [18]. We find 
a trend of negative correlations with amino acids that 
have large volume (Fig. 13 K) and are likely to appear in 
alpha helices (Fig. 13 A). These observations are consis- 
tent with the fact that structurally CDR3 regions form 
loops and bulky amino acids as well as stabilizing alpha 
helix-like interactions would interfere with this structure. 
Core amino acids are at the center of the interface and 
are known to be the main contributors to interface recog- 
nition and affinity. On the other hand interface rim and 
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non-interface (surface) residues, which are both in touch interface forming elements, show similar non-distinctive 
to various degrees with the solvent and are not crucial correlation patterns. 
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